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Abstract 

Current skyline evaluation techniques assume a fixed or- 
dering on the attributes. However, dynamic preferences on 
nominal attributes are more realistic in known applications. 
In order to generate online response for any such preference 
issued by a user, we propose two methods of different char- 
acteristics. The first one is a semi-materialization method 
and the second is an adaptive SFS method. Finally, we con- 
duct experiments to show the efficiency of our proposed al- 
gorithms. 



1 Introduction 

The skyline operator has emerged as an important sum- 
marization technique for multi-dimensional datasets. Given 
a set of m-dimensional data points, the skyline S is the set 
of all points p such that there is no other point q which dom- 
inates p. q is said to dominate p if g is better than p in at 
least one dimension and not worse than p in all other dimen- 
sions. Consider a customer looking for a vacation package 
to Cancun using three criteria: price, hotel-class and num- 
ber of stops. We know that lower price, higher hotel class 
and less stops are more preferable. Thus, if p is in the sky- 
line, then there is no other package q which has lower price, 
higher hotel class and less stops compared with p. 

Skyline queries have been studied since 1960s in the the- 
ory field where skyline points are known as Pareto sets and 
admissible points [10] or maximal vectors [9]. However, 
earher algorithms such as [9, 8] are inefficient when there 
are many data points in a high dimensional space. The prob- 
lem of skyline queries was introduced in the database con- 
text in [1]. 

Most of the existing studies handle only numeric at- 
tributes. Consider an example as shown in Table 1 show- 
ing a set of vacation packages with three attributes or di- 
mensions \ Price, Hotel-class and Hotel-group. Most ex- 
isting works consider the first two attributes which are nu- 



'In this paper, we use the terms "attribute" and "dimension" inter- 
changeably. 



meric, where lower price and higher hotel-class are more 
preferable. Many efficient methods have been proposed for 
so-called full-space skyline queries which return a set of 
skyline points in a specific space (a set of dimensions such 
as price and hotel-class). Some representative methods in- 
clude a block nested loop (BNL) algorithm [1], a sort first 
skyline (SFS) algorithm [7], a bitmap method [19], a nearest 
neighbor (NN) algorithm [13] and a branch and bound sky- 
lines (BBS) method [14, 15]. Recently, skyline computa- 
tion has been extended to consider subspace skyline queries 
which return the skylines in subspaces [23, 17, 22, 18, 16]. 

Hotel-group as shown in Table 1 is a categorical at- 
tribute. There can be partial ordering on categorical at- 
tributes. Some recent studies [3, 2, 4, 6, 5, 12, 11, 20] con- 
sider partially-ordered categorical attributes. In [3, 2], each 
partially-ordered attribute is transformed into two-integer 
attributes such that the conventional skyline algorithms can 
be applied. [4] studies the cost estimation of the skyline 
operator involving the partially ordered attributes. 

Nevertheless, known existing work on categorical at- 
tributes assumes that each attribute has only one order: 
either a total or a partial order. In real life, it is not of- 
ten that categorical attributes have a fixed predefined order. 
For example, different customers may prefer different re- 
alty locations, different car models, or different airlines. We 
call such a categorical attribute which does not come with 
a predefined order a nominal attribute. It is easy to name 
important applications with nominal attributes, such as re- 
alties (where type of realty, regions and style are examples 
of nominal attributes) and flight booking (where airline and 
transition airport are examples of nominal attributes). In 
this paper, we consider the scenarios where different users 
may have different preferences on nominal attributes. That 
is, more than one order need to be considered in nominal 
attributes. 

Furthermore, typically, for a nominal attribute, there may 
be many different values, and a user would not specify an 
order on all the values, but would only list a few of the 
most favorite choices. Table 2 shows different customer 
preferences on Hotel-group. The preference of Alice is 
"T -< M ^ *" which means that she prefers Tulips to 



Package 


Price 


Hotel-class 


Hotel-group 


a 


1600 


4 


T (Tulips) 


b 


2400 


1 


T (Tulips) 


c 


3000 


5 


H (Horizon) 


d 


3600 


4 


H (Horizon) 


e 


2400 


2 


M (Mozilla) 


f 


3000 


3 


M (Mozilla) 



Customer 


Preference 


Skyline 


Alice 


T ^ M ^ * 


{a,c} 


Bob 


No special preference 


{ a, c, e, f } 


Chris 


H -<M ~<* 


{ a, c, e } 


David 


H -<M -<T 


{ a, c, e } 


Emily 


H <T <* 


{a,c} 


Fred 


M -<* 


{ a, c, e, f } 



Table 1 . Vacation packages 

Mozilla and prefers these two to other hotel groups (i.e., 
Horizon). We call such preferences implicit preferences. 
Note that different preferences yield different skylines. As 
shown in Table 2, the skyhne is {a, c} for Alice's prefer- 
ence but {o, c, e, /} for Fred's preference. The numerous 
skylines make the problem highly challenging. 

Some latest works [6, 5] study the problem of preference 
changes, whereupon the query results can be incrementally 
refined. In [12], a user or a customer can specify some val- 
ues in nominal attributes as an equivalence class to denote 
the same "importance" for those values. [11] is an extension 
of [12]. In [11], whenever a user finds that there are a lot 
of irrelevant results for a query, s/he can modify the query 
by adding more conditions so that the result set is smaller 
to suit her/his need. However, these works only focus either 
on the effects of the query changes on the result size, or the 
reuse of skyline results when a query is refined in a progres- 
sive manner, but not on finding efficient algorithms. Here, 
we consider that different users may have different prefer- 
ences and so the preferences are not undergoing refinement 
but they can be different or conflicting from one query to 
another. Also, we focus on the issue of efficient query an- 
swering. Nominal attributes are first considered in [20] but 
there the study is about finding a set of partial orders with 
respect to which a given point is in the skyhne. 

In [15], dynamic skyhne is considered but it is only for 
numeric data, and the "dynamic function" considered is 
based on distance from a user location. Here, we consider 
nominal attributes, and the "dynamic function" is any map- 
ping between the nominal values and the rankings where 
each nominal value is assigned with a ranking value. The 
BBS method does not work in our case. 

Our contributions include the following. (1) To the best 
of our knowledge, this is the first work to study the problem 
of efficient skyhne querying with respect to dynamic im- 
plicit preference on nominal attributes. (2) We propose two 
efficient algorithms of different flavors, namely IPO-Tree 
Search and Adaptive SFS. IPO-Tree is a partial materiahza- 
tion of the skylines for all possible implicit preferences. It 
facihtates the efficient computation of the skyhne for any 
imphcit preference. Adaptive SFS is a httle slower but it 
does not require materialization and has the nice proper- 
ties of being progressive and allows for incremental main- 
tenance. (3) We have conducted extensive experiments to 
show the the efficiency of our proposed algorithms. 



Table 2. Customer preferences 
2 Problem Definition 

A skyline analysis involves multiple attributes. A user's 
preference on the values in an attribute can be modeled by 
a partial order on the attribute. A partial order ^ is a re- 
flexive, asymmetric and transitive relation. A partial or- 
der is also a total order if, for any two values u and v in 
the domain, either u ^ v ov v < u. We write u -< v if 
u < V and u ^ v. A partial order also can be written 
as i? = {{u,v)\u :< v}. u ^ v also can be written as 
(m, v) e R. We call this model as the partial order model. 

By default, we consider points in an m-dimensional 
space § = Di X • • • X Dm. For each dimension Di, we 
assume that there is a partial or total order i?j on the values 
in Di. For a point p, p.Di is the projection on dimension 
Di. If {p.Di, q.Di) e we also write p. D, -< q.Di. 

For points p and q, p dominates q, denoted by p -< q, 
if, for any dimension Di E §, p q, and there exists a 
dimension Di^ G § such that p -< I- If P dominates q, 
then p is more preferable than q according to the preference 
orders. The dominance relation R can be viewed as the in- 
tegration of the preference partial orders on all dimensions. 
Thus, we can write R = {Ri, . . . , Rm)- It is easy to see 
that the dominance relation is a strict partial order 

Given a data set V containing data points in space §, 
a point p e r> is in the skyline of T) (i.e., a skyline point 
in V) if p is not dominated by any points in V. Given a 
preference R, the skyline of V, denoted by SKY{R), is the 
set of skyhne points in T). 

In many applications, there often exist some orders on 
some of the dimensions that hold for all users. In our ex- 
ample in Table 1, a lower price and a higher hotel-class 
are always more preferred by customers. Even for nomi- 
nal attributes, there may exist some universal partial orders. 
Hence, we assume that we are given a template, which con- 
tains a partial order for every dimension. The partial or- 
ders in the template are apphcable to all users. Each user 
can then express his/her specific preference by refining the 
template. The containment relation of orders captures the 
refinement. 

For partial orders R and R', R' is a refinement of R, 
denoted by i? C R', if for any [u, v) e R, {u, v) G R'. 
Moreover, if R C R' and R^ R', R' is said to be stronger 
than R. Let R = {(T, M)} and R' = {{T, M), {H, M)}. 
Then, RC R'. That is, R' is a refinement of R by adding a 



preference H -< M. As R', R' is stronger than R. 

Property 1 For orders R = (i?i, .... Rm) and R' = {R[, 
. . ., R'^), RCR' if and only ifRi C R'.for 1 < i < m. m 

Theorem 1 (Monotonicity) ([20]) Given a data set V and 
a template R, if p is not in the skyline with respect to R, 
then p is not in the skyline with respect to any refinement R' 
ofR m 

Theorem 1 indicates that, when the orders on the dimen- 
sions are strengthened, some skyline points may be disqual- 
ified. However, a non-skyline point never gains the skyline 
membership due to a stronger order. This monotonic prop- 
erty greatly helps in analyzing skylines with respect to var- 
ious orders. 

Definition 1 (Conflict-free) ([20]) Let R and R' be two 

partial orders. R and R' are conflict-free if there exist 
no values u and v such that u ^ v, {u, v) G R, and 

(v, u) e R'. 

Although the model of partial order refinements can 
model diverse individual preferences, it does not fit tightly 
the real world scenarios. In a skyline query, for a nominal 
attribute, users typically would not exphcitly order all val- 
ues, but may specify a few of their favorite choices and also 
give them an ordering. For example, a user may specify 
that the first choice is v, the second choice is v'. The im- 
plicit meaning is that v and v' are better than all the other 
choices, say vi,V2, v^. We can model this by the par- 
tial order model, by including v < v', v ^ Vi, v -< V2, 
V -< Vk and v' -< vi, v' -< V2, v' -< Vk- We denote this 
preference by "f ^ v' -< *" where * means all choices other 
than V and v' (in this case, * corresponds to {v\,V2, Vk}). 
We call this special kind of partial order an implicit prefer- 
ence and assume that it is represented in such a form. For 
example, the imphcit preference "H M *" corre- 
sponds to a set of binary orders {{H, M), {H, T), (M, T)} 
in the partial order model. 

Definition 2 (Implicit Preferences) Let vi,v2, ■■■Vk be all 
the values in a nominal attribute Di. An implicit preference 
RiOnDiis givenbyvi -<V2 -< ■■■Vx -< *. It is equivalent to 
the partial order given by {{vi, Vj)\i < j A i £ [1, x] A j £ 
[l,k]}. 

In the above definition, Ri is said to be an x-th or- 
der implicit preference. Also, the order of ^enoted by 
order{Ri), is defined to be x and the order of R is defined 
to be ma.Xi{order{Ri)} . A value vj is said to be in Ri if 
Vj G {vi,V2, ...jVx}. Also, Vj is said to be the j-th en- 
try in T'iR'i) is defined to be {{vi,Vj)\i < j and i G 
[1, x] and j e [1, k]}. Let R' = {R^,^, R'J. V{R') is 
defined to be Ulli^(^i)- 



Package 


Price 


Hotel-class 


Hotel group 


Airline 


a 


1600 


4 


T (Tulips) 


G (Gonna) 


b 


2400 


1 


T (Tulips) 


G (Gonna) 


c 


3000 


5 


H (Horizon) 


G (Gonna) 


d 


3600 


4 


H (Horizon) 


R (Redish) 


e 


2400 


2 


M (Mozilla) 


R (Redish) 


f 


3000 


3 


M (Mozilla) 


W (Wings) 



Table 3. A table with two nominal attributes. 



In this paper, we adopt the convention that R' denotes 
an imphcit preference and R' denotes a partial order (which 
may or may not be an implicit preference). Also we denote 
SKY{ V{R') ) by SKY(R'). 

Definition 3^(Problem) Given a dataset V and an implicit 
preference R', find the skyline SKY{R') in V. m 

The problem defined above is our objective in this paper. 
We also say_that we want to find a set of skyline points with 
respect to R' in V. In many apphcations, online response 
is required. The extensive study in [15] reports that all the 
existing algorithms have some serious shortcomings and a 
new algorithm BBS is proposed which is much more effi- 
cient than previous methods. However, the data partition- 
ing in BBS is based on fixed orderings on the dimensions 
and the same partitioning cannot be used for dynamic or 
variable preferences on nominal attributes. Therefore, new 
mechanisms need to be explored. 

The problem of dynamic implicit preferences have some 
similar flavor to subspace skyUnes since materiaUzation of 
the possible skylines seems to be a solution. However, as 
noted in [15], most applications involve up to five attributes, 
the dimensionaUty of a typical skyline problem is not high, 
and therefore materialization of the skylines is quite feasible 
and has been investigated in recent works such as [23, 22, 
18, 16]. For dynamic imphcit preferences, the number of 
combinations is exponential not only in the dimensionality 
but also in the cardinalities of the attributes, which makes 
the problem much more challenging. 

3 Partial Materialization: IPO-Tree Search 

In order to support online response, a naive approach 
is to materiahze the skylines for all possible preferences. 
However, as noted in the above, this approach is very costly 
in storage and preprocessing. Our study in [21] shows that, 
even with an index and with compression by removing re- 
dundancies in shared skylines, the cost is still prohibitive. 

Our idea is therefore to materialize some useful partial 
results so that these partial results can be combined effi- 
ciently to form the query results. In particular, we pro- 
pose to materialize the results with respect to the first-order 
implicit preference on each nominal attribute only. Since 
results for the second or higher order preferences are not 



R': M<* 

SKY, = {a, c,e, f} 



R": H<* 
SKY,= {a, c,e} 



PSKY, = {e, f} 



R'": M<H<* 
SKY,= {a, c,e, f} 



[sky, = (SKY, n SKYJ U PSKY ^ 

Figure 1. Illustration of the merging property 

stored, the number of combinations is significantly reduced. 
In the following, we describe an important property called 
the merging property which allows us to derive results of all 
possible implicit preferences of any order by simple opera- 
tions on top of the^nsf-order information maintained. 

Theorem 2 (Merging Property) Let two implicit prefer- 
ences R' and R" differ only at the i-th dimension, i.e., 
Rj = Rj for all j ^ i. Furthermore, R[ ="vi -< ... ^ 

Vx-i -< *" and R'l ^''v^ -< Let PSKY{R') be the 
set of points in SKY{R') with Di values in {vi, ...v^-i}. 
Let R'l' ="vi -< ... ^ Vx-i < Vx < The skyline with 
respect to R'" is {SKY{R') D SKY{R")) U PSKY{R'). 

Proof: A proof is given in the Appendix. ■ 

For example, in Figure 1, let R' be "A/ *" and R" be 
"H -< From Table 1, the skyline with respect to R' is 
SKYi = {a, c, e, /} and the skyline with respect to R" is 
SKY2 = {a, c, e}. PSKYi = {e, /} is the set of skyline 
points with values in {M}. Let R'" be "A/ ^ H ^ *". 
By Theorem 2, the skyline SKY3 with respect to R'" is ob- 
tained as follows. SKY3 = {SKYi D SKY2) U PSKYi = 
({a,c,e,/} n {a,c,e}) U {e,/} = {a, c, e} U {e, /} = 
{a,c, e, /}. The derivation can be explained as follows. 
V{R') and P{R") are not conflict-free because their union 
contains both (M, H) and {H, M). Or, the only difference 
between andP(^"') is that7'(^')U7'(^") 

contains one more binary entry, namely {H, M), which may 
disqualify some data points (in this example, it disqualifies 
/). In order to remove the disqualifying effect, we augment 
the intersection SKYi n SKY2 by a union with PSKYi 
where PSKYi contains the points disqualified by {H, M) 
in SKYi. 

From Theorem 2, we can derive a powerful tool for the 
computation of the skyline with respect to any implicit pref- 
erence of any order by building increasingly higher order re- 
finement (R'" in the theorem) skyline from lower order (R' 
and R") ones, starting with the first-order In the follow- 
ing two subsections, we introduce the IPO-tree for storing 
the first-order preference skylines and the query evaluation 
based on the IPO-tree. 



3.1 Tree Construction 

An IPO-tree {implicit preference order tree) stores re- 
sults for combinations of first-order preferences. In this 
tree, each node is labeled with a first-order implicit pref- 
erence, namely "u -< where v ^ Di and Di is a nominal 
dimension. The tree is of depth m'+l, where m' is the num- 
ber of nominal attributes. The root node stores the skyline 
SKY{R) with respect to template R in V. The second level 
contains all nodes corresponding to first-order implicit pref- 
erences on nominal attribute Di. In general, the children of 
an i-th level node correspond to all the first-order implicit 
preferences on nominal attribute D,. A special child node 
is labeled corresponding to no preference. Each non-root 
node has a label associated with a first-order implicit prefer- 
ence on a single nominal attribute, and maintains results that 
corresponds to the labels along the path to the root node. 
Figure 2 shows an IPO-tree from the data in Table 3, where 
the template R is set to 0. Node 6 corresponds to implicit 
preferences "T ^ *,G ^ *". 

Furthermore, a root node is associated with a set S* = 
SKY{R). But, each non-root node is associated with a set 
A of points where 5' — ^ is the skyline for the correspond- 
ing implicit preference. Therefore, A contains the points in 
SKY{R) that are disqualified from the skyUne at the node 
because of the preference refinement. For example, since, 
in the IPO-tree shown in Figure 2, Node 6 corresponds to 
an implicit preference "T ^ *,G ^ which disqualifies 
points d, e, / in S* as skyline points, A of node 6 is equal to 
{d, e, /}. The purpose of A is to allow us to find the sky- 
line for the node given the skylines of the ancestors. It is 
also possible to store the exact skyline at each node instead. 

Implementation: In order to find the set A for each non- 
root node N, one can apply a skyline algorithm (e.g., adap- 
tive SFS in Section 4). However, in our implementation, 
we make use of the minimal disqualifying conditions intro- 
duced in [20] . For a skyline point p and a template order 
R, a partial order R' is called a minimal disqualifying con- 
dition (or MDC for short) if (1) i?' n i? = 0, (2) R' and 
R are conflict-free, (3) p is not a skyline point with respect 
to i? U i?', and (4) there exists no R" such that R" C R' 
and p is not a skyline point with respect to i? U R" . The 
set of minimal disqualifying conditions for p is denoted by 
MDC{p). The first step here is to find aU MDCs of each 
skyline point in SKY{R). One of the algorithms in [20] 
can be used for this step. Then, given the implicit prefer- 
ence R' corresponding to a node N, we check eachpoint in 
SKY{R), if any of the MDCs is a subset of P(7^'), then 
the point is disqualified and is inserted into A. 

Tree Size: Let m! be the number of nominal attributes and 
c be the maximum cardinality of a nominal attribute. The 
height of the IPO-tree is m' + 1. The size of the tree in 
number of nodes is given by 0(c'" ). As claimed in [13] 




Figure 2. Illustration of an implicit preference order tree 
(IPO-tree) 
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at Node 14 at Node 15 at Node 10 at Node 11 

Figure 3. Query evaluation with an 
IPO-tree 



and quoted in [15], most applications involve up to five at- 
tributes, and hence m' is very small. Note that the IPO-tree 
size is significantly smaller than the number of possible im- 
plicit preferences which is given by 0((c • c!)™ ). 

The tree size can be further controlled if we know the 
query pattern (e.g., from a history of user queries). Typi- 
cally, there are popular and unpopular values. For values 
which are seldom or never chosen in implicit preferences, 
the corresponding tree nodes in the IPO-tree are not needed. 
It is possible to restrict the IPO-tree to say the 10 most pop- 
ular values for each nominal attribute. If a query contain- 
ing unpopular values arrives, the adaptive SFS algorithm in 
Section 4 can be used instead. 

3.2 Query Evaluation 

IPO-tree has a nice structure with a well-controlled tree 
size and can efficiently facilitate implicit preference query- 
ing based on the merging property (Theorem 2). Algo- 
rithm 1 shows the evaluation of a query with an implicit 
preference R' . 



Algorithm 1 query(d, i?', TV, S) 

Input: dimension d, implicit preference R' , tree node A'^, set of 
potential skyline points S 

Local variable: Q - a queue containing sets of points 
\: X 

2: if d / m' then 

3: if i?d contains no preferences then 
4: Nc ^ the child node of N labeled <j) 
5: X ^query(d + 1, R' , Nc, S) 
6: else 

7: Q ^ 

8: for i := 1 to order{R'^) do 

9: V ^the i-th entry in R!^ 

10: Nc ^ child node of A'^ labeled with -< *" 

11: ^ ^ the disqualifying set of Nc 

12: Y ^query(d + 1, R\ N^, S ~ A) 

13: enqueue F to Q 

14: X ^ merge(d + 1, Q, R') (See Algorithm 2) 
15: return X 



Algorithm 2 merge((i, Q, R') 

Input: dimension d, Q storing sets of points, preference R' 
1 : dequeue Q and obtain the dequeued element Y 
2: X ^Y 

3: for i := 2 to order{R'j) do 

4: dequeue Q and obtain the dequeued element Y 

5: let TZ be the set of the first to the (i — l)-th entries in i?^ 

6: Z ^ a set of points pin X with p.Dd £ TZ 

7: X ^ {Xr\Y)VjZ 



Example 1 (Query Evaluation) We use the IPO-tree in 
Figure 2 for the illustration of the detailed steps in implicit 
preference query evaluation. Let us consider four differ- 
ent queries for illustration, namely Qa '■ ''M -< Qb '■ 

"M *,G -< Qc ■■ "M H *,G -< and Qd : 
"M H *,G < R -< 

Consider Qa- We first visit Node 1 and X is set to be S 
of Node 1 (i.e., {a, c, d, e, /}). Node 4 is then visited where 
Ais9, X is still {a, c, d, e, /}, which is the skyline for Qa- 

Consider Qb. After visiting Node 1 , X = {a,c, d, e, /}. 
Next, Node 4 and Node 14 are visited. The skyline is X = 
{a, c, d, e, /} - {d} = {a, c, e, /}. 

Consider Qc- We split the query into subqueries "M -< 
*,G ^ and "iJ ^ *,G ^ with respective skylines of 
{a, c, e, /} and {a, c, e}. The subset PSKYi of SKYi with 
Hotel-group value M is {e, /}. By Theorem 2, the resulting 
skyline is ({a, c, e, /} ("1 {a, c, e}) U {e, /} = {a, c, e, /}. 

Consider Qu- As illustrated in Figure 3, we follow the 
breakdown and obtain the skyline with respect to Qu equal 
to {a, c,e,f}- m 

Theorem 3 With Algorithm 1, query(l, R', Root, SKY(R)) 
returns SKY(R'), given a template R for a dataset V and 
the corresponding IPO-tree with a root node of Root. m 

The number of leaf nodes in a query evaluation tree di- 
agram as the one shown in Figure 3 gives a bound on the 
number of set operations. The number of set operations 
required for an x-th order implicit preference is 0(x'" ). 
Since x and m' are very small, this number is also small. 



Implementation: We have implemented the algorithm by 
accumulating the^et of disqualified points. By Theorem 2, 
if A{R') and A{R") are the sets of disqualified points for 
R' and R", respectively, let B be the set of points in A{R") 
with Di values in {vi, .., v^-i}, the accumulated set of dis- 
qualified points for R'" is given by A{R') U {A{R") - B). 

Another efficient implementation is to store the skyline 
for each node in the IPO-tree by means of a bitmap (re- 
placing A) and to create an inverted list for each nomi- 
nal attribute for an easy lookup to determine a bitmap for 
PSKY{R') (see Theorem 2). Efficient bitwise operations 
can then be used for the set operations. 

4 Progressive Algorithm: Adaptive SFS 

The IPO-tree method requires much preprocessing cost 
and storage. It is also more appropriate for more static 
datasets since changes in the datasets require rebuilding the 
entries in the tree. It is of interest to find an efficient al- 
gorithm which does not involve major overheads, and in 
addition allows incremental maintenance to accommodate 
dynamic updating of the datasets. Here, we propose such a 
method for real-time querying which is based on the Sort- 
First Skyline Algorithm (SFS) [7]. The algorithm is called 
Adaptive SFS and is efficient since it does not require a 
complete resorting of the data for each different user pref- 
erence. It also allows skyline points to be returned in a pro- 
gressive manner. 

4.1 Overview of SFS 

First, we will briefly describe the method of Sort-First 
Skyline (SFS), which is for totally-ordered numerical at- 
tributes. With SFS, the data points are sorted according 
to their scores obtained by a preference function /, which 
can be the sum of all the numeric values in different di- 
mensions of a data point. That is, the score of a point p is 
/(p) = Sfci P-I^i- The criterion for the function is that if 
p < q, then f{p) < f{q). The data points are then exam- 
ined in ascending order of their scores. A skyline list L is 
initially empty. If a point is not dominated by any point in 
L, then it is inserted into L. The sorting takes 0{N log N) 
time while the scanning of the sorted list to generate the 
skyline points takes 0{N ■ n) time, where N is the number 
of data points in the data set and n is the size of the skyline. 

4.2 Adaptive SFS for Implicit Preferences 

Next, we develop an adaptive SFS method for query pro- 
cessing in the data set with implicit preferences on nominal 
attributes, given the skyline set SKY{R) for a template or- 
der R which is implicit. Let R' be an implicit refinement 
over R. From Theorem 1, any skyUne point p for R' will 
also be a skyline point for R. Hence, in order to look for the 
skyline for R', we only need to search SKY{R). 



Algorithm 3 Preprocessing 

1: Compute the skyline set SKY{R) for the given template R 

2: Determine the ranking r based on SKY{R) and / 

3: Apply the presorting step of SFS based on r on SKY{R) 



Algorithm 4 Query Processing 

Input: skyline query, with implicit preference R' 
1: Determine the ranking for the values in R' 
2: Find the data points in SKY{R) that contain values in R'. 

Alter the rankings for such data points if necessary 
3: Delete the points with altered rankings from the sorted list 
4: Re-insert the points just deleted using the new ranking 
5: Apply the skyline extraction step of SFS on the resulting 

sorted list 



Our idea is the following. We adopt the basic presorting 
step on SKY{R) resulting in a sorted list L{R). When a 
query with a refinement R! arrives, we firstjry to re-sort the 
list L{R) and obtain a new sorted list L{R'). The skyline 
generation step is then applied on L{R'). The key to the ef- 
ficiency is that the resorting step complexity is 0(l\ogn), 
where I is the number of data points affected by the refine- 
ment R' and is typically much smaller than n. Next, we 
give more detailed description of the algorithm. 

Each value u in a dimension Di is associated with a rank 
denoted by r{v). In a totally-ordered attribute Di, we de- 
fine r{v) = V for each v in Di. Without loss of generality, 
we assume that a smaller value in a dimension Di is more 
preferable than a larger value in the same dimension. For 
a nominal attribute Di, we assign r{v) as follows. Let Cj 
be the cardinality of nominal dimension Di. By default, for 
each value v for dimension Di, r{v) = Ci. For example, if 
there are 10 different values in dimension Di, then by de- 
fault r{v) = 10 for each v in Di. Given an implicit partial 
order R'^, we can determine a ranking for the values that ap- 
pear in R'^ so that r{v) < r{v') if and only if v ~< v' can 
be derived from i?-. If i?- is "vi ^ V2 ^ ■■■ ~< Vx ^ *", 
then we set r{vi) = 1, r{v2) = 2, r{vx) = x. We define 
f{p)^Y:Ur{p.D,). 

Let / be the number of data points that contain some 
values in R' . The processing time of the sorting list is 
0{l log n). Algorithms 3 and 4 show the steps for prepro- 
cessing the data points and query processing, respectively. 

In Step 2 in Algorithm 4, in 0£der to find data points in 
SKY{R) that contain values in R', one possible way is to 
have an index for each nominal dimension. The index can 
be a simple sorted list or a more sophisticated tree index. 
An index lookup can quickly return the points that contain 
a particular value in R'. Such data points are collected in a 
set. Then, for each point p in the set, the value of f{p) based 
on R aUows us to quickly locate the point in the sorted list. 



The point is deleted from the Hst and re-inserted with a new 
value for /(p) based on the refinement R' . 

For the last step of the query processing, there is no 
need to follow the SFS from scratch. Instead, we reinsert 
the points in the ascending order of the new /(p) values. 
When a point a is re-inserted, we need only check if it 
may be dominated by the R' skyline points sorted before 
it. If so, a is not added; otherwise, we then check if it 
may dominate any SKY{R) skyline point that are sorted 
after it. The points that it dominates will be removed. Let 
c = \SKY{R')\,ji = \SKY{R)\, and /Jje the number of 
points in SKY{R) containing values in Rf. The time com- 
plexity of this step will become 0{l log l+c-l+min{c, l)-n). 
Since the resorting step takes 0{l log n) time, the total time 
is 0{l logn + min(c, I) • n). 

4.3 Properties of Adaptive SFS 

The presorting ensures that a point p dominating another 
point q must be visited before q. This leads to a progressive 
behavior, meaning that any point inserted into the skyline 
list L must be in the skyline set, and it can be reported im- 
mediately. The presorting also enhances the pruning since it 
is more likely that candidate points with lower scores domi- 
nate more other points. Another desirable property of adap- 
tive SFS is that it allows incremental maintenance. Assume 
that the algorithm which finds SKY{R) is incremental. Af- 
ter data is updated, the set SKY{R) is modified. The sorted 
list in the method is altered by simple insertions or dele- 
tions. The time complexity is O(logn) for each such up- 
date. 

5 Empirical Study 

We have conducted extensive experiments on a Pentium 
IV 3.2GHz PC with 2GB memory, on a Linux platform. 
The algorithms were implemented in C/C-H-. In our exper- 
iments, we adopted the data set generator released by the 
authors of [20], which contains both numeric attributes and 
nominal attributes, where the nominal attributes are gener- 
ated according to a Zipfian distribution. The default values 
of the experimental parameters are shown in Table 4^ In 
the experiment, if the order of the implicit preference R' is 
set to X, it means that the order of R'^ for each nominal at- 
tribute Di is X. Note that the total number of dimensions is 
equal to the number of numeric dimensions plus the number 
of nominal dimensions. By default, we adopted a template 
where the most frequent value in a nominal dimension has 
a higher preference than all other values. This corresponds 
to a more difficult setting as the skyline tends to be bigger. 
In the following, we use the default settings unless specified 
otherwise. 

We denote our proposed partial materiahzation methods 
(IPO Tree Search) by IPO Tree and IPO Tree-10 where IPO 
Tree is constructed based on all possible nominal values 



Parameter Default value 

No. of tuples 500K 
No. of numeric dimensions 3 
No. of nominal dimensions 2 
No. of values in a nominal dimension 20 
Zipfian parameter 1 
order of implicit preference 3 

Table 4. Default values 

and IPO Tree-10 is constructed based on only the 10 most 
frequent values for each nominal attribute. We denote the 
Adaptive SFS algorithm by SFS-A. We also compare our 
proposed methods with a baseline algorithm called SFS-J), 
which is the original SFS algorithm [7] returning SKY{R') 
with respect to implicit preference R' for dataset T>. 

We evaluate the performance of the algorithms in terms 
of (1) pre-processing time, (2) the query time of an im- 
plicit preference and (3) memory requirement. We also re- 
port (4) the proportion of the skyline points with respect 
to the template R (i.e., \SKY{R)\/\V\), (5) the proportion 
of skyline points affected in SKY{R) with respect to R' 
(i.e., \AFFECT{R)\/\SKY{R)\), where AFFECT{R) 
is the set of skyline points in SKY{R) with values in 
i?', jind (6) the proportion of skyline points with respect 
to R' in SKY{R) (i.e., \SKY{R')\/\SKY{R)\). For 
pre-processing, both IPO Tree and IPO Tree-10 compute 
SKY{R) and build the correspondence IPO trees, and SFS- 
A compute SKY{R) and pre-sort the data according to the 
preference function /. Note that SFS-D does not require 
any preprocessing. The storage of IPO Tree or IPO Tree-10 
corresponds to the IPO tree stored. SFS-A stores the sorted 
data in SKY{R), and SFS-D does not use extra storage but 
reads the data directly from the dataset. 

For measurements (1) and (3), each experiment was con- 
ducted 100 times and the average of the results was re- 
ported. For measurements (2), (4), (5) and (6), in each ex- 
periment, we randomly generated 100 implicit preferences, 
and the average query time is reported. We will study the 
effects of varying (1) database size, (2) dimensionaUty, (3) 
cardinality of nominal attribute and (4) order of implicit 
preference. 

5.1 Synthetic Data Set 

Three types of data sets are generated as described in [1]: 
(1) independent data sets, (2) correlated data sets and (3) 
anti-correlated data sets. The detailed description of these 
data sets can be found in [1]. For interest of space, we only 
show the experimental results for the anti-correlated data 
sets. The results for the independent data sets and the cor- 
related data sets are similar in the trend but their execution 
times are much shorter. 
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Figure 4. Scalability with respect to database size 
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Figure 5. Scalability with respect to dimensionality where no. of numeric attributes is fixed to 3 
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Figure 6. Scalability with respect to cardinality of nominal attribute 
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Figure 7. Effect of order of implicit preference 
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Figure 8. Effect of order of implicit preference (real data set) 
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Effect of the database size: In Figure 4(d), we note that 
\SKY{R)\/\'D\ decreases slightly when the data size in- 
creases. This is because, when there are more data points, 
there is a higher chance that a data point is dominated by 
other data points. Nevertheless, \SKY{R)\ increases with 
database size, and therefore we see an upward trend in run 
time and in storage. For the IPO tree methods, the skyline 
information size will increase with data size. For SFS-A, the 
preprocessing time is 0{NlogN + Nn) and the query time 
is 0{l logn + min(c, I) ■ n), where N is the data size, I = 
\AFFECT{R)\, c = \SKY{R')\ and n = \SKY{R)\. 
For SFS-D the query time is 0{NlogN + Nn). We can 
see that the results from graphs match with the complexity 
expectation. 

Effect of dimensionality: We study the effect of the num- 
ber of nominal attribute m' where the number of numeric 
attributes is fixed to 3, with the results as shown in Fig- 
ure 5. In Figure 5(d), \SKY{R)\/\V\ increases. With 
more nominal attributes, it is less likely that the data points 
are dominated by others and thus \SKY{R)\ increases. 
\AFFECT{R)\/\SKY{R)\ also increases with m' be- 
cause it is more likely that a data point is affected when 
the implicit preference contains preferences on more nom- 
inal attributes. The number of nodes in a full IPO tree is 
given by 0(c"* ) where c is the cardinality of a nominal 
attribute. Because of these factors, the preprocessing time 
and the query time of all algorithms increase with m' . For 
the same reason, the storage for IPO Tree and the storage of 
SFS-A also increase slightly. 

Effect of Cardinality of Nominal Attribute: Figure 6(d) 
shows that \SKY{R)\ increases with cardinality. This is 
because, when the cardinality increases, there is a higher 
chance that a data point is not dominated by other data 
points. Also, the number of nodes in a full IPO tree is given 
by 0(0™ ) where c is the cardinality of a nominal attribute 
and m' is the number of nominal attributes. Thus, the pre- 
processing time, query time and storage of our proposed al- 
gorithms increases with the cardinality. From Figure 6(b), 
the increase is dampened for SFS-A because the query time 
of SFS-A depends on \AFFECT{R)\ and there is a de- 
crease in \AFFECT{R)\/\SKY{R)\, which is caused by 
fewer data points with frequent nominal values when there 
are more values in a nominal attribute. 

Effect of Order of Implicit Preference: For IPO tree, the 
number of set operations is given by 0{x™' ) where x is 
the order of implicit preference. Hence, in Figure 7(b), the 
query time for IPO Tree increases. The query times for SFS- 
A and SFS-D are slightly dropping because the skyline size 
decreases when the order of implicit preference increases. 
It is obvious that neither the pre-processing or storage will 
be affected. Figure 7(d) shows that the size of affected sky- 
line points increases. This is because more nominal values 



involved in the preference affect more data points. 

5.2 Real Data Set 

To demonstrate the usefulness of our methods, we ran 
our algorithms on a real data set. Nursery data set, which 
is publicly available from the UCIrvine Machine Learning 
Repository^. In this data set, there are 12,960 instances 
and 8 attributes. The experimental setup is same as [20]. 
There are six totally-order attributes and two nominal at- 
tributes, namely form of the family and the number of chil- 
dren. (Note that although the number of children is a nu- 
meric attribute, it is not clear whether a family with one 
child is "better" than a family with two children.) The cardi- 
nality of both nominal attributes are equal to 4. The results 
in the performance are similar to those for the synthetic data 
sets. Figure 8 shows the results on the real data set with the 
effect of the order of implicit preference. 

5.3 Main Observations 

The major findings from the experiments are the follow- 
ings. The SFS-D algorithm cannot meet real-time require- 
ments, since the query time is at least in terms of tens of 
seconds and, in some cases, exceeds 1000 seconds. In gen- 
eral, IPO Tree is the fastest but SFS-A can also return the re- 
sult within a second in most cases and under 20 seconds in 
the worst case, and is orders of magnitude faster than SFS- 
D. The results with IPO Tree-10 show that, by handling a 
smaller set of nominal values, one can control both the pre- 
processing and storage costs. A hybrid approach adopting 
IPO Tree for popular values and SFS-A for handUng queries 
involving the remaining values is a sound solution. 

6 Conclusion 

Most previous works on the skyline problem consider 
data sets with attributes following a fixed ordering. How- 
ever, nominal attributes with dynamic orderings according 
to different users exist in almost all conceivable real-life ap- 
plications. In this work, we study the problem of online re- 
sponse for such dynamic preferences, two methods are pro- 
posed with different flavors: a semi-materialization method 
and an adaptive SFS method. Our experiments show how 
our proposed algorithms are useful in different problem set- 
tings. 
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7 Appendix: Proof of Theorem 2 

Proof: We need to show that a point p is in SKY(R"') if and 
orAyiritism{SKY{R')r\SKY{R''))UPSKY{R'). For each 
direction, we prove by contradiction. 

[A] Firstly, assume p is in SKY{R"'), and suppose that p is 
not in {SKY(R') n SKYJ^R")) UPSKY{R'). Then, by The- 
orem 1, since p € SKY{R"') and R'" is a refinement of R', we 
deduce that p € SKY{R'). Thus, p must satisfy the following: 



• Condition 1: p.Di ^ {vi, ...v^-i} and 

• Condition 2: p ^ S'Ji:y(.R"). 

Consider Condition 2. Since p ^ SKY{R"), there exists a 
data point q dominating p w.r.t R". In other words, with respect 
to R", q.Dk ^ p.Dk for all k and in at least one dimension Dj, 
q.Dj -< p.Dj. Let JT" be the set of dimensions Dj where q.Dj -< 
p.Dj w.r.t R". Besides, for all dimensions Dk other than Di, the 
partial orders of R" and R'" are the same. Hence, w.r.t. R'", 
q.Dk < p.Dk for all fc(/ i). There are two subcases: Case (i): 
Di^Jwd Case (ii): Di G J. 

Case (i): Di ^ J. For all D^ € J, since q.Dj -< p.Dj 
w.r.t R" and the partial orders in R'J are those in 7?"', we have 
q.Dj -< p.Dj w.r.t. R'" . Also, w.r.t. R'", q.Dk ^ p.Dk for all 
k ^ i. Hence, since i ^ J , for dimension Di, it must be the case 
that p.Di -< q.Di w.r.t R'" . Otherwise, p is dominated by q w.r.t 
R'", and p cannot be in SKY(R"'). Since p.D^ ^ q.Dj, w.r.t 
R'", we have p. ^ q.Di. Since q.Dk p.Dk w.r.t. R" for 
all k, and p.Di ^ q.Di, we have q.Di -< p.Di w.r.t R". Since 
the implicit preference in R" is "Vx -< *", we conclude that p.Di 
cannot be . Since _R"' is "ui -< ... -< Vx -< *" and p.Di -< q.Di 
w.r.t R'", p.Di must be in {vi, ...Vx-i}. However, this violates 
Condition 1 discussed above. Hence, we arrive at a contradiction. 

Case (ii): Di G J. We obtain g.A -< p.Di w.r.t. R". 
Besides, since the implicit preference in R" is "Vx -< *", q.Di 
must be equal to Vx and p.Di cannot be equal to Vx- Since 
p G SKY(R"'), there is no other point including q dominating 
p w.r.t. R'". Note that, w.r.t. R"\ q.Dk < p.Dk for all A;(f^ i). 
We obtain p. A < q.D^ w.r.t. R'" . (Otherwise, q.Di -< p.Di 
w.r.t R'" and p is dominated by q w.r.t. R'" , which leads to a 
contradiction.) Besides, since q.Di = Vx, p.Di ^ Vx and R'" is 
"vi -< ... -< Vx -< *", p.Di mustbe in {vi, ...Vx-i}. However, 
this violates Condition 1. Hence, we arrive at a contradiction. 

[B] Conversely, consider a point p in {SKY(R') n 
SKY{R")) U PSKY{R'). Suppose thatp is not in SKY{R"'). 
Thus, p is dominated by some point q w.r.t. R'" . That is, w.r.t 
R'" , q.Dk < p.Dk for all k and q.Dj -< p.Dj for at least one 
dimension Dj. 

Since p G {SKY{R') n SKY{R")) U PSKY{R'), we know 
that at least one of the following two conditions holds. _ 

• Condition 3: p.Di G {vi, ...Vx-i} andp G SKY{R'), or 

• Condition 4: p G SKY{R') andp G SKY{R^). 
Consider Condition 3. Since p G SKY(R') and p ^ 

SKY{R"') where R'/' is a refinement of Rl and R'l^' = R'k for 
all k i, we deduce that q.Di -< p.Di exists in partial orders 
of R'" but not in partial orders of R' . Since q.Di -< p.Di w.r.t. 
R"',p.Di G {vi,...,Vx-i} and R'" is"vi ^ ... ^ Vx ^ *", 
we deduce q.Di G {vi, ...Vx-2}. For each possible binary or- 
der q. A -< p.Di w.r.t. R'" where p.Di G and 
q.Di G {vi, ...Vx-2}, we also conclude that q.Di -< p.Di exists 
in the partial orders of R', which leads to a contradiction. 

Consider Condition 4. Since R', R" and R'" differ only at 
dimension Di, we only need to check their implicit preferences to 
see that, whenever q.Di ^ p.Di (or q.Di -< p.Di) w.r.t. R'", it is 
also true w.r.t. R' or R". Therefore, q also dominates p w.r.t. R' 
or R". That is, p SKY{R') or p SKY{R"), which leads to 
a contradiction. ■ 



